Minería de Texto: Clustering de Códigos y Tipos de Comercio


Clustering de Códigos y Tipos de Comercio

Merchant Category Codes (MCC) es un estandar ISO que agrupa todos los comercios que aceptan tarjetas de crédito en cerca de 1000 categorías, cada una con una breve descripción (texto!).

Pero, mil tipos de comercios es un número impráctico para temas de análisis del consumo con tarjetas de crédito.

Trataremos de usar minería de texto para intentar reducir ese número.

Clustering de Códigos y Tipos de Comercio

Por claridad y brevedad, nos abstraemos del código que prepara los datos:

  • Carga los datos
  • Aplica RegEx al texto para limpiarlo un poco
  • Extrae los tokens o terminos
  • Elimina stopwords
  • Sustituye los terminos por sus respectivas raices (stemming)

In [1]:
from prep import helpers
from prep import clustering

In [2]:
df = helpers.get_mcc_data()

In [3]:
df.head()[['mcc', 'irs_description','desc']]


Out[3]:
mcc irs_description desc
0 742 Veterinary Services servic veterinari
1 763 Agricultural Cooperative agricultur cooper
2 780 Landscaping Services landscap servic
3 1520 General Contractors contractor gener
4 1711 Heating, Plumbing, A/C air condit heat plumb

Clustering de Códigos y Tipos de Comercio

Recuerdan esta maquinita?


In [4]:
X, vectorizer = clustering.get_tfidf(df.desc)

In [5]:
print(df.shape)
print(X.shape)


(981, 9)
(981, 1081)

Clustering de Códigos y Tipos de Comercio

Que paso? Nuestro super vectorizer:

  • Extrae los tokens o terminos
  • Elimina stopwords
  • Sustituye los terminos por sus respectivas raices (stemming)
  • Convierte cada documento en un vector de terminos
  • Construye una matriz (document-term matrix) con el TF-IDF de cada par (documento, termino).

Clustering de Códigos y Tipos de Comercio

Un vistazo a la famoso document-term matrix:


In [6]:
X[1,:25]


Out[6]:
<1x25 sparse matrix of type '<class 'numpy.float64'>'
	with 2 stored elements in Compressed Sparse Row format>

Que? ...

  • las descripciones usan pocas palabras por lo que la mayoría de las columnas están en 0
  • es más eficiente computacionalmente almacenar unicamente los valores diferentes de 0 (ver sparse matrix)

In [7]:
X[1,:25].todense()


Out[7]:
matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.57735027, 0.57735027]])

In [8]:
vectorizer.get_feature_names()[23:25]


Out[8]:
['agricultur', 'agricultur cooper']

Clustering de Códigos y Tipos de Comercio

Hasta ahora vimos la parte de texto, y la parte de minería?

Reducción de dimensionalidad

K-Means Clustering


Clustering de Códigos y Tipos de Comercio

Que se reduce a...


In [10]:
# Reduce los datos a 50 dimensiones y construye 40 clusters
clusters = clustering.get_clusters(X, k_choices=[40], dim_choices=[50])


Varianza explicada con 50 dimensiones: 73%
Metrica de calidad del clustering con 50 dimensiones y 40 clusters: 0.7037

...sencillo, no?

Se puede perfectamente experimentar con otras combinaciones de K y dimensiones...

Clustering de Códigos y Tipos de Comercio

Al final:

  • Determinamos las dimensiones más importantes de cada grupo
  • A partir de estas buscamos las palabras asociadas a dichas dimensiones para formar los topicos

In [11]:
rs_df = clustering.add_cluster_descriptions(df, clusters, vectorizer)
rs_df.groupby('cluster_description').mcc.count()


Out[11]:
cluster_description
agenc, detect, agenc detect, agenc car, travel                         5
airlin, airlin carrier, air airlin, carrier, air                     300
car rental, car, rental, agenc, agenc car                             93
care, travel, child, associ, child servic                              9
cash disbur, cash, disbur, autom, video                                9
clean, laundri, specialti, clean specialti, clean mainten              5
cloth, men, cloth men, men store, store                                4
club, order, book, money, record                                       8
commerci, equip, commerci equip, footwear, commerci footwear           9
comput, softwar, comput repair, comput program, program                5
contractor, gener, electr, contractor gener, contractor electr         5
dealer, dealer motorcycl, motorcycl, motorcycl shop, boat              8
direct, market, direct market, market merchant, merchant               8
florist, air, club, line, condit                                       8
good, digit, digit good, store, good nondur                           12
govern, govern servic, govern lotteri, lotteri, licen                  8
home, dealer, dealer home, home store, suit                            4
hotel inn, inn motel, resort, motel resort, motel                    291
insur, premium, insur premium, underwrit, premium underwrit            3
jewelri, watch, metal, repair watch, jewelri repair                    4
laundri, autom, fuel, dispen fuel, dispen                              9
line, florist, recreat, crui, crui line                               11
miscellan, gener, specialti, miscellan servic, servic                 10
optician, eyeglass, eyeglass optician, optic optician, good optic      5
organ, religi, organ religi, membership organ, membership              4
park, amu, trailer, carniv, carniv park                                5
photograph, studio, photograph studio, equip, suppli                   4
place, restaur, drink, drink place, eat                                5
product, pictur, petroleum, petroleum product, video                   6
public, relat, transport, public relat, consult public                 4
repair, repair shop, shop, electron, repair weld                       8
sale, pool, door sale, door, telecommun                                5
school, correspond, correspond school, secretari, busi                 5
servic, adverti, adverti servic, servic telegraph, telegraph          28
shop, servic shop, antiqu, auto, antiqu shop                          14
store, accessori, automot, women, store wear                           6
store, shoe, record store, record, shoe store                         28
suppli, store suppli, store, home store, home                         10
truck, part, lea, sale servic, servic                                  8
util, medic, hospit, dental, equip                                     8
Name: mcc, dtype: int64

In [13]:
rs_df.head(20)[['mcc', 'cluster_id', 'irs_description', 'cluster_description']]


Out[13]:
mcc cluster_id irs_description cluster_description
0 742 5 Veterinary Services servic, adverti, adverti servic, servic telegr...
1 763 10 Agricultural Cooperative cash disbur, cash, disbur, autom, video
2 780 5 Landscaping Services servic, adverti, adverti servic, servic telegr...
3 1520 27 General Contractors contractor, gener, electr, contractor gener, c...
4 1711 6 Heating, Plumbing, A/C florist, air, club, line, condit
5 1731 27 Electrical Contractors contractor, gener, electr, contractor gener, c...
6 1740 33 Masonry, Stonework, and Plaster util, medic, hospit, dental, equip
7 1750 27 Carpentry Contractors contractor, gener, electr, contractor gener, c...
8 1761 26 Roofing/Siding, Sheet Metal jewelri, watch, metal, repair watch, jewelri r...
9 1771 27 Concrete Work Contractors contractor, gener, electr, contractor gener, c...
10 1799 27 Special Trade Contractors contractor, gener, electr, contractor gener, c...
11 2741 16 Miscellaneous Publishing and Printing miscellan, gener, specialti, miscellan servic,...
12 2791 28 Typesetting, Plate Making, and Related Services public, relat, transport, public relat, consul...
13 2842 19 Specialty Cleaning clean, laundri, specialti, clean specialti, cl...
14 3000 1 Airlines airlin, airlin carrier, air airlin, carrier, air
15 3001 1 Airlines airlin, airlin carrier, air airlin, carrier, air
16 3002 1 Airlines airlin, airlin carrier, air airlin, carrier, air
17 3003 1 Airlines airlin, airlin carrier, air airlin, carrier, air
18 3004 1 Airlines airlin, airlin carrier, air airlin, carrier, air
19 3005 1 Airlines airlin, airlin carrier, air airlin, carrier, air

Gracias!